Analysis of how chemical properties influence the quality of red wines by Meixian Chen

Introduction: This report uses R to quantitatively analyze how chemical properties influence the quality of red wines. The tidy wine data consist of 1599 red wines with 11 variables on the chemical properties of the wine. Each wine is rated by at least 3 wine experts, providing a rating between 0 (very bad) and 10 (very excellent).

Overview of the wine dataset

Here is part of the dataset, contains 1599 observations of 11 chemical properties and quality ratings of wines.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Summary of statistics of each variables

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Univariate Plots Section

Visual distribution of each variables

The distribution of wine quality is close to normal distribution. Most wines are rated as average (5 or 6), there are a few bad wines (less than 5) and good wines (more than 7).

We continue to inspect the distribution of each of the 11 variables.

The distribution of fixed.acidity is close to normal distribution. The fixed acidity values above 12.5% are considered as outliers.

I remove the outlier points which has a close to 1.6% volatile acidity from the dataframe in the further analysis.

Different from the above two variables, the distribution of citric.acid has a peak value in 0, and the numbers decrease with the citric.acid values. I remove the outliers which value is 1.00.

Most of the residual.sugar level are between [0,4]. This distribution has a long and small tail.

The distribution of chlorides is very similar to residual.sugar, and both of them have a long tail of outliers. Are the outliers in the two plot from the same observation of the data?

The next two boxplots are plotting chlorides variable on the data set which the residual.sugar level is less than 3, and residual.sugar on the data set which the chlorides level is less than 1.2.

grid.arrange( 
              ggplot(red[red$residual.sugar<3,], aes( x = 1, y = chlorides ) ) + 
               geom_jitter(alpha = 0.1 ) +
               geom_boxplot(alpha = 0.2, color = 'red') ,
              ggplot(red[red$chlorides<1.2,], aes( x = 1, y = residual.sugar ) ) + 
               geom_jitter(alpha = 0.1 ) +
               geom_boxplot(alpha = 0.2, color = 'red' ),
            ncol=2)

If the outliers are from some common points, we expect to see the change from the original plots. Since there is no obvious change on the new plots, the outliers are from different points. Moreover, the correlation value cor(red\(residual.sugar,red\)chlorides) = 0.059 is rather low. Similar distributions of two variables do not means there is a high correlationship between them.

cor(red$residual.sugar,red$chlorides)
## [1] 0.05371106

I remove outliers with values >200 in the further analysis.

The distribution of red wine density are in a small range between 0.99 g/ml to 1.01 g/ml (pure water is 1.00 g/ml). It matches what I expected.

Wine has pH between 2.9 to 4.

I remove outliers with sulphates values >1.5.

The alcohol level of red wines is almostly between 8~14 degrees.

Univariate Analysis

The distributions of most variables are close to normal distribution: the majority of individuals are in the middle slots and fewer are in the low/ high slots.

We are mainly interested on which chemical properties affect the wine quality, and also how much they relate to each other. Here is matrix presenting the correlation values among the variables. Red color dots shows negative correlation and blue are positive correlation. The darkness and the size of the dots shows how strong is the correlation.

From the correlation matrix, quality is mainly related to alcohol and volatile.acidity levels, then to citric.acid and sulphates levels.

Bivariate Plots and Analysis Section

We plot the quality variable with the chemical properties which highly related to it.

By plotting the alcohol and quality, we can see better quality red wines tends to have higher alcohol levels. For average and good wines (rated more than 4), the mean of alcohol levels increases with the ranking categories.

Good wine tends to have low volatile acidity.

The third chemical property we try to plot is citric.acid. It is the third/fourth variable that relates to quality, and it is also strongly related to volatile.acidity. We would first provide bivariate plot of citric.acid and quality, and in the next section, further investigate the multi-variable relationship among quality and citric.acid and volatile.acidity.

Good wine tends to have higher citric acid.

Multivariate Plots and Analysis Section

The first multivariable plot is a point plot, using the two most strongest properties, alcohol and volatile acidity as axis, and quality as color to highlight the distribution of different categories of wines. The majority of good wines lie in the zone of higher alcohol value and lower volatile acidity value. While the bad wines are the opposite case.

The second multivariable plot uses citric acid and volatile acidity as axis, and quality as color to highlight the distribution of different categories of wines. The majority of good wines lie in the zone of higher citric acid value and higher volatile acidity value. While the bad wines are the opposite case. More interesting, wine with higher citric acid level tends to have also higher volatile acidity level.

The last plot we present is a 3D figure. We select only the very good (rating maximal 8) and very bad wine (rating minimal 4) from the dataset in order to have a clear image. The interesting fact we find is that, good wines usually have a lower volatile acidity, higher citric acid and higher alcohol level, while bad wines are the opposite.

Final Plots and Summary

To summary this report, we select three presentive plots, an one-variable plot showing the distribution of wine quality, a bi-variable plot showing how a chemical property influences the wine quality, and a multi-variable plot showing how chemical properties influences the wine quality and also how they relate to each other.

Plot One

This first plot gives an overview of wine quality distribution. Wine quality distribution is close to normal distribution, and it is as we expect: the majority of wines on the market are average, very good and very bad wines are few.

Plot Two

The second plot shows how one of the most strongly influencing chemical property,alcohol, affects the wine quality. Good wines tend to have a high alcohol level in general.

Plot Three

The third plot shows how citric acid and volatile acidity influence the wine quality. Moreover these two variable is also related to each other. The majority of good wines lie in the zone of higher citric acid value and higher volatile acidity value. While the bad wines are the opposite case. More interesting, wine with higher citric acid level tends to have also higher volatile acidity level.

Reflection

The most different part when starting the analysis was to decide starting from which variable. I wasted some time at the beginner while trying to plot insightful figures. Generating a correlation matrix among variable, or using ggpair library to analysis a subset of the dataset in advance of bi/multi-variable plotting is very useful. The analysis went more smooths by picking up the right combination of variables. Besides, I was also struggled on generating clear figures to show the patterns of the data, especially on the multi-variable plotting. By adopting the suggestion of draw linear regression line of different categories of wine qualities, it is easier to discover the patterns.

Wine tasting is a personal thing. Some people prefer some kinds of wines while the others have different opinions. One future work on the dataset analysis could be, storing separately the rating from different experts, and then to identify what is the common chemical properties of good wine for each individual. Thus, we can recommend good wine based on similar personal preference, rather than average opinions of some experts.